Lab 01

Learning objectives

By the end of the lab, you will be able to …

  • setup a reproducible workflow using R and RStudio
  • familiarize yourself with a dataset using R and RStudio
  • create a reproducible report using Quarto

R & RStudio Workflow

Replication

The guiding principle for workflow.

A workflow of data analysis is a process for managing all aspects of data analysis.


Planning, documenting, and organizing your work; cleaning the data; creating, renaming, and verifying variables; performing and presenting statistical analyses; producing replicable results; and archiving what you have done are all integral parts of your workflow.

Steps in a workflow

Set up Systematic organization of the project and project files.
Familiarize self with data Skipping takes more time in the long run.
Process data Takes the MOST time.
Running analyses What people THINK takes the most time.
Presenting results What people (wrongly) think does not take time.

File types

There are many file types, but these are key to an R & RStudio workflow (and likely new to you):

Extension Description
.Rproj RStudio project file (keeps project settings).
.R R scripts store a sequence of R commands (code) that can be run all at once or line by line.
.qmd Quarto Markdown creates reproducible documents that contain a combination of text, code, and output.
.Rdata (or sometimes .rda) These store and load R objects—like data frames.

File names

should be:

  • machine-readable
  • human-readable
  • play well with default-ordering

RStudio projects


Create a RStudio project for each data analysis project.

It supports an organized and reproducible workflow, cleanly separated from all other projects that you are working on. Everything you need in one place:

  • local data files to load into RStudio.
  • scripts to edit or run in bits or as a whole.
  • Save your outputs (plots and cleaned data).

Filepaths

Adopting a project-based workflow avoids changing file paths.


ABSOLUTE FILE PATHS

Department of Sociology
Unit 17100, 17th Floor, Ontario Power Building
700 University Ave., Toronto, ON M5G 1Z5

C:\Users\Pepin\GitHub\SOC6302\scripts

RELATIVE FILE PATHS

Take the left side elevators to the 17th floor.
Go through the double doors and a take a right.
First door on your left.

here(scripts)

Tour recap: Panes

There are four key regions or “panes” in the interface:

  1. Source pane: where you can edit and save R scripts or author computational documents like Quarto and R Markdown.

  2. Console pane: is used to write short interactive R commands.

  3. Environment pane: displays temporary R objects created during that R session.

  4. Output pane: displays the plots, tables, or HTML outputs of executed code along with files saved to disk.

Source Pane

The top-left panel and can be launched by opening any editable file in RStudio.

R-scripts and Quarto

Open RStudio, then click the dropdown arrow next to the “New File icon,” and then “R script” or “Quarto Document.”

Blank slate

Clear the memory at every restart of RStudio by turning off the automatic saving of your workspace and .Rdata files with you quit RStudio. This is important for reproducibility, debugging, and avoiding littering your computer with unnecessary files.

Set this via:

  1. Tools > Global Options.
  2. Uncheck “Restore .RData into Workspace at Startup”.
  3. Choose “Never” on the “Save workspace to .RData on exit”.
  4. Click “Apply” and “OK”.

Comprehensive R Archive Network (CRAN)

CRAN is like an App Store for R. It hosts R packages, documentation, and source code contributed by users worldwide. It is mediated (e.g., quality controlled), making it incredibly reliable.

R users can easily install, update, and share R packages using install.packages().

Packages

R comes with basic tools, but packages extend the capabilities of base R (what you already installed). An R package is like a toolbox: a collection of functions, data, and documentation that help you do specific tasks using R.


You’ll install each package (only once per system):

install.packages("tidyverse")


You’ll load each package (every time you use it):

library(tidyverse)

Quarto

Quarto

The tool you’ll use to create reproducible computational documents. Every piece of assignment you hand in will be a Quarto document.

  • Fully reproducible reports
  • R code + narrative

RScript

great for learning, exploring and tinkering.

rerun it without attention to formatting or markdown.

Quarto

great for communicating analysis and results

combines narrative explanation with code output (results.

Documentation

Tour recap: Quarto

Tour recap: Quarto Code-chunks

  • chunk labels are helpful for describing what the code is doing, for jumping between code cells in the editor, and for troubleshooting
  • message: false hides any messages emitted by the code in your rendered document

How will we use Quarto?

  • Every code-along and milestone will be a Quarto document
  • The scaffolding will decrease over the course
  • You will create and submit a Quarto document for your research project

Getting Started

Create a RStudio Project


To create a new project in RStudio, click: File > New Project.

In the New Project wizard that pops up, select: New Directory, then New Project.

Name the project “SOC6302” and click: Create Project.

This will launch you into a new RStudio Project inside a new folder called “SOC6302”.

Your first code-along

Download and open code-along-01.qmd

Packages

We’ll use the following packages:

  • here() (relative file paths)
  • tidyverse() (data wrangling)
  • gssr() (U.S. General Social Survey data)
  • gssrdoc() (GSS documentation)

Install here() and tidyverse()

Let’s install the two packages that are available on CRAN.


Copy and paste the following code into your Console pane. Then hit enter.

install.packages("here")


Then, do the same to install the tidyverse package.

install.packages("tidyverse")

Install gssr() and gssrdoc()

# Install 'gssr' from 'ropensci' universe
install.packages('gssr', repos =
  c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))

# Also recommended: install 'gssrdoc' as well
install.packages('gssrdoc', repos =
  c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))

Tip

R ignores text after #. These comments describe syntax.

Load the packages

library(here)
library(tidyverse)
library(gssr)
library(gssrdoc)

Environment

# software documentation
sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8   
[3] LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                   
[5] LC_TIME=English_Canada.utf8    

time zone: America/Toronto
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gssrdoc_0.7.0      here_1.0.1         conflicted_1.2.0   summarytools_1.1.4
 [5] flextable_0.9.6    kableExtra_1.4.0   labelled_2.13.0    haven_2.5.4       
 [9] gssr_0.7           lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
[13] dplyr_1.1.4        purrr_1.0.4        readr_2.1.5        tidyr_1.3.1       
[17] tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        viridisLite_0.4.2       fastmap_1.2.0          
 [4] fontquiver_0.2.1        pacman_0.5.1            promises_1.3.3         
 [7] digest_0.6.37           timechange_0.3.0        mime_0.13              
[10] lifecycle_1.0.4         gfonts_0.2.0            magrittr_2.0.3         
[13] compiler_4.5.1          rlang_1.1.6             tools_4.5.1            
[16] utf8_1.2.4              yaml_2.3.10             data.table_1.15.4      
[19] knitr_1.50              askpass_1.2.0           curl_5.2.1             
[22] plyr_1.8.9              xml2_1.3.6              httpcode_0.3.0         
[25] withr_3.0.2             grid_4.5.1              fansi_1.0.6            
[28] gdtools_0.3.7           xtable_1.8-4            colorspace_2.1-0       
[31] scales_1.3.0            MASS_7.3-65             crul_1.4.2             
[34] cli_3.6.5               rmarkdown_2.29          crayon_1.5.3           
[37] ragg_1.3.2              generics_0.1.3          rstudioapi_0.17.1      
[40] reshape2_1.4.4          tzdb_0.4.0              cachem_1.1.0           
[43] pander_0.6.5            matrixStats_1.3.0       base64enc_0.1-3        
[46] vctrs_0.6.5             jsonlite_2.0.0          fontBitstreamVera_0.1.1
[49] hms_1.1.3               rapportools_1.2         systemfonts_1.1.0      
[52] magick_2.8.7            glue_1.8.0              codetools_0.2-20       
[55] stringi_1.8.4           gtable_0.3.5            later_1.4.2            
[58] munsell_0.5.1           pillar_1.9.0            htmltools_0.5.8.1      
[61] openssl_2.2.0           R6_2.6.1                tcltk_4.5.1            
[64] textshaping_0.4.0       rprojroot_2.0.4         evaluate_1.0.4         
[67] shiny_1.11.0            backports_1.5.0         memoise_2.0.1          
[70] fontLiberation_0.1.0    httpuv_1.6.16           pryr_0.1.6             
[73] Rcpp_1.0.14             zip_2.3.1               uuid_1.2-0             
[76] svglite_2.1.3           checkmate_2.3.2         officer_0.6.6          
[79] xfun_0.52               fs_1.6.6                pkgconfig_2.0.3        

Project structure

Let’s set up your project structure using the here() package.

here()

First, let’s establish our project directory

# shows the file path to the root of the project
here()


Next, we’ll create folders within our project.

Example folder structure

Research Projects

project/
data/
gss7924-raw.rda
gss7924-processed.Rdata/
scripts/
clean_data.R
analyze_data.R
draft.qmd
outputs/
draft.html
figures/
plot1.png
plot2.png
readme.qmd
project.Rproj

SOC6302

SOC6302/
data/
gss7924-raw.rda
gss7924-processed.Rdata/
code-alongs/
milestones/
project/
data/
scripts/
outputs/
readme.qmd
SOC6302.Rproj

Create a folder structure

using here() and dir.create()

# Create base folders
dir.create(here("data"), recursive = TRUE)
dir.create(here("code-alongs"), recursive = TRUE)
dir.create(here("milestones"), recursive = TRUE)
dir.create(here("project"), recursive = TRUE)

Create sub-folders

using here() and dir.create()

# Create project sub-folders
dir.create(here("project", "data"), recursive = TRUE)
dir.create(here("project", "scripts"), recursive = TRUE)
dir.create(here("project", "outputs"), recursive = TRUE)

Check your work

report a list of folders and or files in the R-project folders and sub-folder.

# Your SOC6302 class folder
list.files(path = here())

# Your "Project" sub-folder
list.files(path = here("project"))

Save code-along


Save this code-along in your newly created “code-along” sub-folder.


There’s no command in the R console to save scripts or Quarto files— you use the editor’s File > Save As or Ctrl+S.

Meet your data

We’re going to use data from the U.S. General Social Survey (GSS).

Load your data

# Load the data (will appear in your Global Environment pane)
data(gss_all)

# Preview the datatable which is automatically named gss_all
gss_all
# A tibble: 75,699 × 6,867
   year         id wrkstat    hrs1        hrs2        evwork      occ   prestige
   <dbl+lbl> <dbl> <dbl+lbl>  <dbl+lbl>   <dbl+lbl>   <dbl+lbl>   <dbl> <dbl+lb>
 1 1972          1 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 205   50      
 2 1972          2 5 [retire… NA(i) [iap] NA(i) [iap]     1 [yes] 441   45      
 3 1972          3 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 270   44      
 4 1972          4 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap]   1   57      
 5 1972          5 7 [keepin… NA(i) [iap] NA(i) [iap]     1 [yes] 385   40      
 6 1972          6 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 281   49      
 7 1972          7 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 522   41      
 8 1972          8 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 314   36      
 9 1972          9 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 912   26      
10 1972         10 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 984   18      
# ℹ 75,689 more rows
# ℹ 6,859 more variables: wrkslf <dbl+lbl>, wrkgovt <dbl+lbl>,
#   commute <dbl+lbl>, industry <dbl+lbl>, occ80 <dbl+lbl>, prestg80 <dbl+lbl>,
#   indus80 <dbl+lbl>, indus07 <dbl+lbl>, occonet <dbl+lbl>, found <dbl+lbl>,
#   occ10 <dbl+lbl>, occindv <dbl+lbl>, occstatus <dbl+lbl>, occtag <dbl+lbl>,
#   prestg10 <dbl+lbl>, prestg105plus <dbl+lbl>, indus10 <dbl+lbl>,
#   indstatus <dbl+lbl>, indtag <dbl+lbl>, marital <dbl+lbl>, …

Load GSS 2024

# Get the data only for the 2024 survey respondents
gss24 <- gss_get_yr(2024)

# look at the first 6 rows of the dataframe
head(gss24)
# A tibble: 6 × 639
  year        id wrkstat hrs1        hrs2        evwork      marital martype    
  <dbl+lb> <dbl> <dbl+l> <dbl+lbl>   <dbl+lbl>   <dbl+lbl>   <dbl+l> <dbl+lbl>  
1 2024         1 1 [wor…    43       NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
2 2024         2 5 [ret… NA(i) [iap] NA(i) [iap]     1 [yes] 5 [nev… NA(i) [iap]
3 2024         3 5 [ret… NA(i) [iap] NA(i) [iap]     1 [yes] 1 [mar…     1 [mar…
4 2024         4 2 [wor…    20       NA(i) [iap] NA(i) [iap] 5 [nev… NA(i) [iap]
5 2024         5 5 [ret… NA(i) [iap] NA(i) [iap]     1 [yes] 3 [div… NA(i) [iap]
6 2024         6 4 [une… NA(i) [iap] NA(i) [iap] NA(i) [iap] 1 [mar…     1 [mar…
# ℹ 631 more variables: divorce <dbl+lbl>, widowed <dbl+lbl>,
#   spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, spevwork <dbl+lbl>,
#   cowrksta <dbl+lbl>, coevwork <dbl+lbl>, cohrs1 <dbl+lbl>, cohrs2 <dbl+lbl>,
#   sibs <dbl+lbl>, childs <dbl+lbl>, age <dbl+lbl>, educ <dbl+lbl>,
#   speduc <dbl+lbl>, coeduc <dbl+lbl>, codeg <dbl+lbl>, degree <dbl+lbl>,
#   padeg <dbl+lbl>, madeg <dbl+lbl>, spdeg <dbl+lbl>, sex <dbl+lbl>,
#   race <dbl+lbl>, res16 <dbl+lbl>, reg16 <dbl+lbl>, mobile16 <dbl+lbl>, …

Browse dataframe

With your mouse, go to the environment panel (upper-right) and click on the “gss24” object. It pops up and you can browse through it.


This is often a good idea to get a first feel for the data, but only if your dataset is relatively small.

Codebook

The GSS documentation is available online in .pdf form.

The .pdfs will be useful for general overviews.


For specific variable information, it will be helpful to use the documentation you’ll load into RStudio.

# Load the codebook
data(gss_dict)

Names

To see the variables available in the dataset, use the names() command.

names(gss_all)


Tip

This command is best to use with smaller datasets.

Variable documentation

For information about a specific GSS variable,
type ?varname at the console.


In the output pane, the Help tab will show the variable documentation.


Tip

Replace “varname” with the name of a variable.
Example: ?meovrwrk

Variable documentation example


meovrwrk {gssrdoc}  R Documentation
Men hurt family when focus on work too much
Description
meovrwrk

Details
Question 1297. And, do you agree or disagree: c. Family life often suffers because men concentrate too much on their work.

Overview
For further details see the official GSS documentation.

Counts by year:

year    iap agree   can't choose    disagree    neither agree nor disagree  no answer   strongly agree  strongly disagree   skipped on web  Total
1972    1613    -   -   -   -   -   -   -   -   1613
1973    1504    -   -   -   -   -   -   -   -   1504
1974    1484    -   -   -   -   -   -   -   -   1484
1975    1490    -   -   -   -   -   -   -   -   1490
1976    1499    -   -   -   -   -   -   -   -   1499
1977    1530    -   -   -   -   -   -   -   -   1530
1978    1532    -   -   -   -   -   -   -   -   1532
1980    1468    -   -   -   -   -   -   -   -   1468
1982    1860    -   -   -   -   -   -   -   -   1860
1983    1599    -   -   -   -   -   -   -   -   1599
1984    1473    -   -   -   -   -   -   -   -   1473
1985    1534    -   -   -   -   -   -   -   -   1534
1986    1470    -   -   -   -   -   -   -   -   1470
1987    1819    -   -   -   -   -   -   -   -   1819
1988    1481    -   -   -   -   -   -   -   -   1481
1989    1537    -   -   -   -   -   -   -   -   1537
1990    1372    -   -   -   -   -   -   -   -   1372
1991    1517    -   -   -   -   -   -   -   -   1517
1993    1606    -   -   -   -   -   -   -   -   1606
1994    1545    695 33  243 286 27  122 41  -   2992
1996    1444    825 16  198 169 1   230 21  -   2904
1998    2832    -   -   -   -   -   -   -   -   2832
2000    940 877 43  361 331 22  209 34  -   2817
2002    1857    415 6   264 108 -   99  16  -   2765
2004    1906    460 4   188 135 -   94  25  -   2812
2006    2518    945 14  477 304 1   208 43  -   4510
2008    694 653 12  310 161 -   143 50  -   2023
2010    614 662 6   388 192 3   122 57  -   2044
2012    672 558 11  382 170 -   130 51  -   1974
2014    863 702 7   479 234 1   176 76  -   2538
2016    979 819 9   536 257 -   171 96  -   2867
2018    789 644 11  475 220 2   134 73  -   2348
2021    1315    886 1   487 1001    -   202 138 2   4032
2022    1168    885 15  537 618 1   201 117 2   3544
2024    1126    787 19  481 611 -   195 89  1   3309
Total   50650   10813   207 5806    4797    58  2436    927 5   75699
Values
1 strongly agree

2 agree

3 neither agree nor disagree

4 disagree

5 strongly disagree

NA(d) can't choose

NA(i) iap

NA(j) I don't have a job

NA(m) dk, na, iap

NA(n) no answer

NA(p) not imputable

NA(r) refused

NA(s) skipped on web

NA(u) uncodeable

NA(x) not available in this release

NA(y) not available in this year

NA(z) see codebook

Source
General Social Survey https://gss.norc.org

[Package gssrdoc version 0.7.0 Index]

We can find this which years one or more variables were asked with the gss_which_years() function.

gss_which_years(gss_all, meovrwrk)
# A tibble: 35 × 2
   year      meovrwrk
   <dbl+lbl> <lgl>   
 1 1972      FALSE   
 2 1973      FALSE   
 3 1974      FALSE   
 4 1975      FALSE   
 5 1976      FALSE   
 6 1977      FALSE   
 7 1978      FALSE   
 8 1980      FALSE   
 9 1982      FALSE   
10 1983      FALSE   
# ℹ 25 more rows

TIP

If run in the console, to see all rows, wrap the code in the print() command: print(gss_which_years(gss_all, fefam), n = 40)